AI assisted video annotation tool

Image and video annotation tools enable the labeling or tagging of objects, people, text, and other elements within visual media, serving to create training datasets essential for machine learning models, computer vision applications, and other AI systems reliant on annotated data for learning and prediction.

Existing annotation tools have been limited in its features, capability, or access. Certain tools require manual selection of objects within an image, requiring users to draw outlines of the objects one by one. Newer tools are developed with edge-detection algorithms or AI models to automate the selection process, but are limited in its capability in working with video files with multiple frames. More capable software tools are often paid, licensed and closed source, intended for commercial use with limited access for students and researchers. The goal of this project was to create a simple, intuitive video annotation tool that addresses these limitations and shortcomings.

Using labelme, a Python-based open source annotation tool as a base, we implemented "Segment Anything," an AI segmentation model from Meta AI, seamlessly integrating automatic object selection, and inter-frame segmentation feature for images and video frames, greatly streamlining the annotation process while being free and accessible to anyone.

An example of a segmentation model output of an image. The model highlights all the objects it finds within the image, which must be processed and converted into object outlines and point coordinates. The resulting data can used for object selection based on basic user inputs. The final annotation data is saved in JSON format.

Different models can produce different levels of object segmentation. For performing annotation, a model that could perform segmentation on any arbitrary type of object and scene was required. Meta AI's Segment Anything was newly released at the time of this project, and has out-performed most existing models released to date.

The main milestones and challenges of this project were:

1. Research methods of highlighting and isolating objects within any image and video, using conventional image processing algorithms (edge detection, depth based) or using AI-based approaches (segmentation models).

2. Implement the segmentation feature within a graphical user interface.

3. Allow the use of simple basic inputs to perform the annotation process, optimizing workflow for images as well as videos.

In the end, the following were achieved using this tool:

Intuitive User Interface: The process of object selection and annotation is simplified by a use of AI segmentation model, the user can highlight an object to be annotated with single or multiple points, or bounding boxes.

Support for multiple frames of videos: Rather than annotating an object frame-by-frame, the annotation from the previous frame can be used to automatically generate the following one.

Free and Open Source: Our implementation is built on top of an existing open source tool, labelme, which is free to use by anyone and written in Python. Anyone can see, edit and make further improvements.

Framework for future improvement: There are potential for future improvements further enhancing the annotation process, such as advanced annotation methods like semantic segmentation and temporal annotation, or integration with cloud-based storage solutions, and performance optimizations.

For more details on the approach, implementation, and results, click on the paper on the right.

Source code available on Github

github.com/ericpk/labelme